Sentiment Analysis using Transfer Learning - v20.06.01
Author(s): TAMOGHNA SAHA

Sentiment Classification is a perfect problem in Natural language Processing (NLP) for getting started in it. As the name suggests, it is classification of peoples opinion or expressions into different sentiments, such as Positive, Neutral, and Negative.

NLP is a powerful tool, but in real-world we often come across tasks which suffer from data deficit and poor model generalisation. Transfer learning solved this problem. It is the process of training a model on a large-scale dataset and then using that pretrained model to conduct learning for another downstream task (i.e., target task).

Importing libraries

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/tamoghna/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

All libraries are imported successfully!

Data Loading & Preprocessing

In this notebook, I am using Sentiment140. It contains two labeled data:

  • data of 1.6 Million Tweets to be used as train,validation,test split data
  • data of 498 Tweets to be used as another fresh test data

Data dictionary are as follows:

  • target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  • ids: The id of the tweet (2087)
  • date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • flag: The query (lyx). If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted (robotickilldozr)
  • text: the text of the tweet (Lyx is cool)

NOTE: The training data isn't perfectly categorised as it has been created by tagging the text according to the emoji present. So, any model built using this dataset may have lower than expected accuracy, since the dataset isn't perfectly categorised.

DATA
target id date flag user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....

Let us explore the data for better understanding.





We will be requiring only target and text columns. As observed from the above report, we have only positive (4) and negative (0) sentiment. We will replace 4 as 1 for convenience.

Also, it's a perfectly balanced dataset without any skewness - equal distribution of positive and negative sentiment.

Number of categories: 2

Text Preprocessing

At first glance, it's evident that the data is not clean. Tweet texts often consists of other user mentions, hyperlink texts, emoticons and special characters which no value as feature to the model we are training. So we need to get rid of them. To do this, we need to perform 4 crucial process step-by-step:

  1. Hyperlinks and Mentions: In Twitter, people can tag/mention other people's ID and share URLs/hyperlinks. We need to eliminate this as well.

  2. Stopwords : These are commonly used words (such as “the”, “a”, “an”, “in”) which have no contextual meaning in a sentence and hence we ignore them when indexing entries for searching and when retrieving them as the result of a search query.

  3. Spelling Correction: We can definitely expect incorrect spellings in the tweets/data, and we need to fix as many as possible, because without doing this, the following step will not work properly.

  4. Stemming/Lemmatization: The goal of both stemming and lemmatization is to reduce inflectional and derivationally related forms of a word to a common base form. However, there is a difference which you can understand from the image below.

Lemmatization is similar to stemming with one difference - the final form is also a meaningful word. Thus, stemming operation does not need a dictionary like lemmatization. Hence, here we will be going ahead with lemmetization.

1, 2 and 4 can be done using the library NLTK, and spell-checking using pyspellchecker.

List of stop words:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Some words like not, haven't, don't are included in stopwords and ignoring them will make sentences like this was not good and this was good or He is a nice guy... not! and He is a nice guy... ! have same predictions. So we need to eliminate the words that expresses negation, denial, refusal or prohibition.

Updated list of stop words:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'if', 'or', 'because', 'as', 'while', 'of', 'at', 'about', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'on', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'should', "should've", 'd', 'll', 'm', 'o', 're', 've', 'y', 'ma', 'won']

Now, let us define the function to perform the necessary preprocessing.

Now, we will apply the function preprocess on each value of the column text where tweets are located.

DATA
target id date flag user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ aww bummer shoulda got david carr third day em...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton upset update facebook by texting might cry res...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus dived many time for ball managed save 50 rest ...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF whole body feel itchy like fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli no not behaving mad see

Let's take a quick look at the words that are frequently used for positive and negative tweets.

Data Split

We will shuffle the dataset and split it to gives train, validation and test dataset. It's important to shuffle our dataset before training. The split is in the ratio of 6:2:2 respectively.

Train: (960000, 2), Validation: (320000, 2), Test: (320000, 2)
TRAIN DATA
target text
631903 0 idk hope soon miss guy
151923 0 oohh oh presented really dont wanna sit pretty...
205188 0 going work sad day
1470937 1 sorry fountain raped bet feel violated going b...
695036 0 now hour till work sad panda

Pre-trained Embedding model

Understanding the difference

There are two types of embedding in NLP domain:

WORD EMBEDDING

  • Baseline
    1. Word2Vec
    2. GloVe
    3. FastText
  • State-of-the-art
    1. ELMo (Embeddeding from Language Model)
    2. BERT (Bidirectional Encoder Representations from Transformers)
    3. OpenAI GPT (Generative Pre-Training Transformer)
    4. ULMFiT (Universal Language Model Fine-Tuning) - This is more of a process that includes word embedding along with NN architecture.

SENTENCE EMBEDDING

  • Basline
    1. Bag of Words
    2. Doc2Vec
  • State-of-the-art
    1. Sentence BERT
    2. Skip-Thoughts and Quick-Thoughts
    3. InferSent
    4. Universal Sentence Encoder

So, the fundamental difference is that Word Embedding turns a word to N-dimensional vector, but the Sentence Embedding is much more powerful because it is able to embed not only words but phrases and sentences as well.

ULMFiT is considered to be the best choice for Transfer Learning in NLP but it is built using fast.ai library in which the code implementation is different from that of Keras or Tensorflow, hence for this notebook, we will be using Universal Sentence Encoder.

Universal Sentence Encoder

It can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs.

It takes variable length English text as input and outputs a 512-dimensional vector. Handling variable length text input sounds great, but the problem is that as sentence keeps getting longer counted by words, the more diluted embedding results could be.

Hence, there are 2 Universal Sentence Encoders to choose from with different encoder architectures to achieve distinct design goals:

  • Transformer architecture that targets high accuracy at the cost of greater model complexity and resource consumption
  • Deep Averaging Network(DAN) that targets efficient inference with slightly reduced accuracy using simple architecture

Both models were trained with the Stanford Natural Language Inference (SNLI) corpus. The SNLI corpus is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). Essentially, the models were trained to learn the semantic similarity between the sentence pairs.

This model is trained using DAN. DAN works in three simple steps:

  1. take the vector average of the embeddings associated with an input sequence of tokens
  2. pass that average through one or more feedforward layers
  3. perform (linear) classification on the final layer’s representation

The primary advantage of the DAN encoder is that compute time is linear in the length of the input sequence.

This module is about 1GB. Depending on your network speed, it might take a while to load the first time you run inference with it. After that, loading the model should be faster as modules are cached by default.

Embedding size: 512

We have loaded the Universal Sentence Encoder and computing the embeddings for some text can be as easy as shown below.

Message: Elephant
Embedding size: 512
Embedding: [-0.016987269744277, -0.00894981063902378, -0.007062734104692936, ...]

Message: I am a sentence for which I would like to get its embedding.
Embedding size: 512
Embedding: [0.035313352942466736, -0.025384267792105675, -0.00788002647459507, ...]

Message: Universal Sentence Encoder embeddings also support short paragraphs. There is no hard limit on how long the paragraph is. Roughly, the longer the more 'diluted' the embedding will be.
Embedding size: 512
Embedding: [0.01879092864692211, 0.04536517709493637, -0.020010894164443016, ...]

Model creation

We have loaded the Universal Sentence Encoder as variable embed. To have it work with Keras, it is necessary to wrap it in a Keras Lambda layer and explicitly cast its input as a string. Then we build the Keras model in its standard Functional API. We can view the model summary and realize that only the Keras layers are trainable, that is how the transfer learning task works by assuring the Universal Sentence Encoder weights untouched.

Now, the let's eliminate the confusion between the terms that is used in deep learning aspect - loss function and optimizer.

The loss function is a mathematical way of measuring how wrong the predictions are.

During the training process, we tweak and change the parameters (weights) of the model to try and minimize that loss function, and make the predictions as correct and optimized as possible. But how exactly is it done, by how much, and when?

This is where optimizers come in. They tie together the loss function and model parameters by updating the model in response to the output of the loss function.

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               131328    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                16448     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 130       
=================================================================
Total params: 147,906
Trainable params: 147,906
Non-trainable params: 0
_________________________________________________________________

Training

Now, we train the model with the training dataset and validate its performance at the end of each training epoch with validation dataset.

Train on 960000 samples, validate on 320000 samples
Epoch 1/5
960000/960000 [==============================] - 144s 149us/step - loss: 0.4850 - accuracy: 0.7639 - val_loss: 0.4738 - val_accuracy: 0.7711
Epoch 2/5
960000/960000 [==============================] - 143s 149us/step - loss: 0.4726 - accuracy: 0.7717 - val_loss: 0.4673 - val_accuracy: 0.7752
Epoch 3/5
960000/960000 [==============================] - 144s 150us/step - loss: 0.4674 - accuracy: 0.7754 - val_loss: 0.4646 - val_accuracy: 0.7764
Epoch 4/5
960000/960000 [==============================] - 160s 167us/step - loss: 0.4639 - accuracy: 0.7776 - val_loss: 0.4628 - val_accuracy: 0.7777
Epoch 5/5
960000/960000 [==============================] - 154s 160us/step - loss: 0.4612 - accuracy: 0.7791 - val_loss: 0.4621 - val_accuracy: 0.7785

Evaluation

Now that we have trained the model, we can evaluate its performance. We will some evaluation metrics and techniques to test the model.

Train Accuracy: 0.7855, Test Accuracy: 0.7772

The Learning Curve of loss and accuracy of the model on each epoch are shown as below:

Prediction

Finally, lets perform some predictions to see where and why are we getting false positives.

target text predicted
423508 0 fever feel horrible brought idk help 0
253748 0 phone screen cracked 0
1474624 1 nice way start middle day 1
1592279 1 saw raw 11 wake now little summer ill sleep later 0
223725 0 tired excited for tonight pooped 1
46522 0 coding solution end not practical coder life not wine rose 0
849904 1 im really tired im off bed for tonight byee everybody 0
1373626 1 got facebook weird twitter kinda rule 1
1227719 1 looking forward day 1
427189 0 feel gross going watch one tree hill bed 0
821075 1 thank 1
615311 0 oh fun moving new office soon 1
569986 0 damn no light gun action dell sux 0
573211 0 well work for construction co work limited office hour getting cut back 0
1512776 1 four scheduled post every day new post until next week 1
278171 0 friend rock sock never leave 0
985052 1 living dream quotation mark forgotten inserted 0
336575 0 last ep for gg sad but finally damn blair amp chuck got together find out gossip girl emojiwink 0
464610 0 bad idea wear heel ouch 0
1436720 1 10 dollar make show 1

Now, we will perform prediction on the clean test data set provided along with the train data.

TEST DATA
target id date flag user text
0 4 3 Mon May 11 03:17:40 UTC 2009 kindle2 tpryan @stellargirl I loooooooovvvvvveee my Kindle2. ...
1 4 4 Mon May 11 03:18:03 UTC 2009 kindle2 vcu451 Reading my kindle2... Love it... Lee childs i...
2 4 5 Mon May 11 03:18:54 UTC 2009 kindle2 chadfu Ok, first assesment of the #kindle2 ...it fuck...
3 4 6 Mon May 11 03:19:04 UTC 2009 kindle2 SIX15 @kenburbary You'll love your Kindle2. I've had...
4 4 7 Mon May 11 03:21:41 UTC 2009 kindle2 yamarama @mikefish Fair enough. But i have the Kindle2...
5 4 8 Mon May 11 03:22:00 UTC 2009 kindle2 GeorgeVHulme @richardebaker no. it is too big. I'm quite ha...
6 0 9 Mon May 11 03:22:30 UTC 2009 aig Seth937 Fuck this economy. I hate aig and their non lo...
7 4 10 Mon May 11 03:26:10 UTC 2009 jquery dcostalis Jquery is my new best friend.
8 4 11 Mon May 11 03:27:15 UTC 2009 twitter PJ_King Loves twitter
9 4 12 Mon May 11 03:29:20 UTC 2009 obama mandanicole how can you not love Obama? he makes jokes abo...

Number of categories: 3

We are witnessing 3 categories instead of 2, the extra sentiment that we have is neutral, but we haven't trained the model on it. Even if we try to keep it, during one-hot encoding, we will obtain 3 columns which will go against the model output architecture (which has 2). So we have to discard this.

Number of categories: 2

Now, we will evaluate and predict the data with and without all the text preprocessing, and analyze the difference.

New Test Data size: 359

New test data evaluation

Without preprocessing: 0.8552 || With preprocessing: 0.8607

Not much of a significant difference. Let's try to see some of the outputs to understand where is the difference coming.

target text processed_text predicted processed_predicted
0 1 @stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right. loovvee kindle2 not dx cool but fantastic right 1 1
1 1 Reading my kindle2... Love it... Lee childs is good read. reading kindle2 love lee child good read 1 1
2 1 Ok, first assesment of the #kindle2 ...it fucking rocks!!! ok first assesment kindle2 fucking rock 1 1
3 1 @kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :) love kindle2 mine for month never looked back new big one huge no need for remorse emojismile 1 1
4 1 @mikefish Fair enough. But i have the Kindle2 and I think it's perfect :) fair enough but kindle2 think perfect emojismile 1 1
5 1 @richardebaker no. it is too big. I'm quite happy with the Kindle2. no big quite happy with kindle2 1 1
6 0 Fuck this economy. I hate aig and their non loan given asses. fuck economy hate aig non loan given ass 0 0
7 1 Jquery is my new best friend. jquery new best friend 1 0
8 1 Loves twitter love twitter 1 1
9 1 how can you not love Obama? he makes jokes about himself. not love obama make joke 0 1
11 0 @Karoli I firmly believe that Obama/Pelosi have ZERO desire to be civil. It's a charade and a slogan, but they want to destroy conservatism firmly believe obama pelosi zero desire civil charade slogan but want destroy conservatism 1 0
12 1 House Correspondents dinner was last night whoopi, barbara & sherri went, Obama got a standing ovation house correspondent dinner last night whoopi barbara amp sherri went obama got standing ovation 0 0
13 1 Watchin Espn..Jus seen this new Nike Commerical with a Puppet Lebron..sh*t was hilarious...LMAO!!! watchin espn jus seen new nike commerical with puppet lebron sh hilarious lmao 1 1
14 0 dear nike, stop with the flywire. that shit is a waste of science. and ugly. love, @vincentx24x dear nike stop with flywire shit waste science ugly love 0 0
15 1 #lebron best athlete of our generation, if not all time (basketball related) I don't want to get into inter-sport debates about __1/2 lebron best athlete generation not time basketball related don want get inter sport debate 1 0
16 0 I was talking to this guy last night and he was telling me that he is a die hard Spurs fan. He also told me that he hates LeBron James. talking guy last night telling die hard spur fan also told hate lebron james 0 0
17 1 i love lebron. http://bit.ly/PdHur love lebron 1 1
18 0 @ludajuice Lebron is a Beast, but I'm still cheering 4 the A..til the end. lebron beast but still cheering til end 1 1
19 1 @Pmillzz lebron IS THE BOSS lebron bos 1 1
20 1 @sketchbug Lebron is a hometown hero to me, lol I love the Lakers but let's go Cavs, lol lebron hometown hero lol love lakers but let go cavs lol 1 0

Executive Summary

The objective of this notebook is to analyze and classify the sentiment of Tweets obtained from Twitter as positive or negative.

After identifying the relevant columns required, we performed an intensive text preprocessing that can be provided as an input to the model, without the need to even tokenization - a step that is required in traditional deep learning approach.

Using Universal Sentence Encoder, which is a state-of-the-art pre-trained sentence embedding module, we contextualized the tweets and created a model that holds the information as to which tweets are referring to a positive sentiment, and which ones are negative.

An interesting observation is that despite the architecture on which this Encoder works yields less accuracy, the dataset not properly tagged, and such a small number of NN layers - we are obtaining a pretty decent accuracy.